A household robot should be able to navigate to target locations without requiring users to first annotate everything in their home. Current approaches to this object navigation challenge do not test on real robots and rely on expensive semantically labeled 3D meshes. In this work, our aim is an agent that builds self-supervised models of the world via exploration, the same as a child might. We propose an end-to-end self-supervised embodied agent that leverages exploration to train a semantic segmentation model of 3D objects, and uses those representations to learn an object navigation policy purely from self-labeled 3D meshes. The key insight is that embodied agents can leverage location consistency as a supervision signal - collecting images from different views/angles and applying contrastive learning to fine-tune a semantic segmentation model. In our experiments, we observe that our framework performs better than other self-supervised baselines and competitively with supervised baselines, in both simulation and when deployed in real houses.
translated by 谷歌翻译
个性化移动代理中的感知系统需要开发室内场景理解模型,这些模型可以理解3D几何,捕获客观性,分析人类行为等。但是,与户外环境的模型相比,该方向并未得到充分探索(例如自动驾驶系统,包括行人预测,汽车检测,交通标志识别等)。在本文中,我们首先讨论主要挑战:不足,甚至没有标记为现实世界室内环境的数据,以及其他挑战,例如异质信息来源(例如RGB图像和LIDAR点云)之间的融合,建模关系建模关系在各种输出集(例如3D对象位置,深度估计和人类姿势)和计算效率之间。然后,我们描述MMISM(多模式输入多任务输出室内场景理解模型)来应对上述挑战。 MMISM认为RGB图像以及稀疏的LIDAR点是输入和3D对象检测,深度完成,人体姿势估计和语义分割作为输出任务。我们表明,MMISM在PAR上执行甚至比单任务模型更好。例如,我们在基准Arkitscenes数据集上将基线3D对象检测结果提高了11.7%。
translated by 谷歌翻译
新颖的对象字幕(NOC)旨在描述包含对象的图像,而无需在训练过程中观察其地面真相标题。由于缺乏字幕注释,无法通过序列到序列训练或苹果酒优化直接优化字幕模型。结果,我们提出了启用释义(P2C),这是一个针对NOC的两阶段学习框架,它将通过释义通过释义来优化输出字幕。使用P2C,字幕模型首先从仅在文本语料库中预先训练的语言模型中学习释义,从而扩展了Bank一词以提高语言流利度。为了进一步实施足够描述输入图像的视觉内容的输出字幕,我们对引入的忠诚度和充分性目标进行字幕模型执行自我贴形。由于在训练过程中没有任何地面真相标题可用于新颖的对象图像,因此我们的P2C利用交叉模式(图像文本)关联模块可以确保可以正确保留上述字幕特征。在实验中,我们不仅表明我们的P2C在NOCAPS和COCO字幕数据集上实现了最先进的性能,而且还通过替换NOC的语言和跨模式关联模型来验证学习框架的有效性和灵活性。实施详细信息和代码可在补充材料中找到。
translated by 谷歌翻译
自我监督学习的最新发展使我们有可能进一步减少人类干预的多步管道中的干预,其中重点围绕着特定感兴趣的对象而发展。在本文中,焦点在组织病理学图像中的细胞核中放置。特别是,我们旨在以无监督的方式提取蜂窝信息,以完成下游任务。随着核以各种尺寸表现出来,我们提出了一个新的依赖量表卷积层来绕过调整核时尺寸的问题。在三个核数据集上,我们基准了以下方法:手工制作的,预先训练的重新系统,有监督的重新系统和自我监督的特征。我们表明,所提出的卷积层提高了性能,并且与Barlows-Twins结合使用,与低样本设置中的监督范式相比,该层可以更好地编码核编码,并且胜过所有其他建议的无监督方法。此外,我们将现有的TNBC数据集扩展到合并核类别的注释,以丰富和公开释放一个小样本设置数据集以进行核分割和分类。
translated by 谷歌翻译
Self-supervised approaches for speech representation learning are challenged by three unique problems: (1) there are multiple sound units in each input utterance, (2) there is no lexicon of input sound units during the pre-training phase, and (3) sound units have variable lengths with no explicit segmentation. To deal with these three problems, we propose the Hidden-Unit BERT (HuBERT) approach for self-supervised speech representation learning, which utilizes an offline clustering step to provide aligned target labels for a BERT-like prediction loss. A key ingredient of our approach is applying the prediction loss over the masked regions only, which forces the model to learn a combined acoustic and language model over the continuous inputs. HuBERT relies primarily on the consistency of the unsupervised clustering step rather than the intrinsic quality of the assigned cluster labels. Starting with a simple k-means teacher of 100 clusters, and using two iterations of clustering, the HuBERT model either matches or improves upon the state-ofthe-art wav2vec 2.0 performance on the Librispeech (960h) and Libri-light (60,000h) benchmarks with 10min, 1h, 10h, 100h, and 960h fine-tuning subsets. Using a 1B parameter model, HuBERT shows up to 19% and 13% relative WER reduction on the more challenging dev-other and test-other evaluation subsets. 1
translated by 谷歌翻译
对比性自我监督学习(SSL)学习一个嵌入式空间,该空间将相似的数据对映射到更紧密的数据对,并且不同的数据对较远。尽管成功了,但一个问题被忽略了:使用对比SSL学到的表示的公平方面。在不缓解的情况下,对比度SSL技术可以结合诸如性别或种族之类的敏感信息,并在下游任务上产生不公平的预测。在本文中,我们提出了一种有条件的对比学习(CCL)方法来改善对比度SSL方法的公平性。我们的方法从对敏感属性的分布调节中的分布对正面和负对进行了对阳性和负对采样,或者从经验上讲,从同一性别或同一种族中抽样正面和负面对。我们表明,我们的方法证明可以最大程度地提高正面对学的表示表示之间的条件相互信息,并通过将其作为条件变量来降低敏感属性的效果。在七个公平和视觉数据集上,我们从经验上证明,与无监督的基线相比,所提出的方法可以实现最新的下游性能,并显着提高了对比度SSL模型在多个公平度量方面的公平性。
translated by 谷歌翻译
Human language is often multimodal, which comprehends a mixture of natural language, facial gestures, and acoustic behaviors. However, two major challenges in modeling such multimodal human language time-series data exist: 1) inherent data non-alignment due to variable sampling rates for the sequences from each modality; and 2) long-range dependencies between elements across modalities. In this paper, we introduce the Multimodal Transformer (MulT) to generically address the above issues in an end-to-end manner without explicitly aligning the data. At the heart of our model is the directional pairwise crossmodal attention, which attends to interactions between multimodal sequences across distinct time steps and latently adapt streams from one modality to another. Comprehensive experiments on both aligned and non-aligned multimodal time-series show that our model outperforms state-of-the-art methods by a large margin. In addition, empirical analysis suggests that correlated crossmodal signals are able to be captured by the proposed crossmodal attention mechanism in MulT.
translated by 谷歌翻译
避免碰撞是移动机器人和代理在现实世界中安全运作的关键。在这项工作中,我们提出了一个有效而有效的避免碰撞系统,该系统结合了现实世界增强学习(RL),基于搜索的在线轨迹计划和自动紧急干预,例如自动紧急制动(AEB)。RL的目的是学习有效的搜索启发式方法,以加快寻找无碰撞轨迹的搜索并减少触发自动紧急干预措施的频率。这种新颖的设置使RL能够在现实世界中的室内环境中安全,直接在移动机器人上学习,从而最大程度地减少培训的实际崩溃。我们的现实世界实验表明,与多个基线相比,我们的方法具有更高的平均速度,较低的崩溃率,更高的目标达到速率,较小的计算开销以及整体控制更平滑。
translated by 谷歌翻译
We address the problem of unsupervised domain adaptation when the source domain differs from the target domain because of a shift in the distribution of a latent subgroup. When this subgroup confounds all observed data, neither covariate shift nor label shift assumptions apply. We show that the optimal target predictor can be non-parametrically identified with the help of concept and proxy variables available only in the source domain, and unlabeled data from the target. The identification results are constructive, immediately suggesting an algorithm for estimating the optimal predictor in the target. For continuous observations, when this algorithm becomes impractical, we propose a latent variable model specific to the data generation process at hand. We show how the approach degrades as the size of the shift changes, and verify that it outperforms both covariate and label shift adjustment.
translated by 谷歌翻译
Semi-supervised object detection is important for 3D scene understanding because obtaining large-scale 3D bounding box annotations on point clouds is time-consuming and labor-intensive. Existing semi-supervised methods usually employ teacher-student knowledge distillation together with an augmentation strategy to leverage unlabeled point clouds. However, these methods adopt global augmentation with scene-level transformations and hence are sub-optimal for instance-level object detection. In this work, we propose an object-level point augmentor (OPA) that performs local transformations for semi-supervised 3D object detection. In this way, the resultant augmentor is derived to emphasize object instances rather than irrelevant backgrounds, making the augmented data more useful for object detector training. Extensive experiments on the ScanNet and SUN RGB-D datasets show that the proposed OPA performs favorably against the state-of-the-art methods under various experimental settings. The source code will be available at https://github.com/nomiaro/OPA.
translated by 谷歌翻译